T128 refactor loader #128, #122 #138

dolsysmith · 2021-08-13T15:44:58Z

Summary

This branch contains the following changes:

Upgrades Spark to v. 3.1.2
Upgrades Java to v. 11.11+9
Upgrades Python to 3.8 (and other dependencies as required)
Upgrades Pyspark to 3.1.2
spark-loader command now uses the Spark DataFrame API to 1) load Tweets into ES and 2) create extracts.

Setup

You will need to have an NFS mount shared between your primary and cluster nodes/VM's in order for this branch to work properly. See the instructions in the original issue.

Make sure that loader.docket-compose.yml is configured to build the Docker image for the Dockerfile-loader.Likewise for the spark-master and spark-worker containers in docker-compose.yml on both primary and cluster nodes.

You'll also need to update your .env file to point the DATASET_PATH variable to the shared NFS mount. (On my VM's, this is /storage/dataset_loading.)

Currently, the data extracts are written to the same directory used for loading datasets. So they will not appear in the TweetSets UI, which is still looking for them elsewhere. But I didn't want to touch tweetset_server.py in this branch, given the work Laura has been doing with the Python 3.8 upgrade.

Testing

It will be useful to load the same dataset with the regular loader and then with the Spark loader, in order to compare results in the UI.

Bring up TweetSets (making sure it builds the new images for Dockerfile-spark.
Create a dataset.json file in the directory with the JSON of the tweets to load.
Bring up the loader container.
Load a sample dataset using the regular (non-Spark) loader.
Edit the dataset.json with a name to distinguish it from the previous load.
Load the dataset again with the following command:

spark-submit \
 --jars elasticsearch-hadoop.jar \
 --master spark://$SPARK_MASTER_HOST:7101 \
 --py-files dist/TweetSets-2.1.0-py3.8.egg,dependencies.zip \
 --conf spark.driver.bindAddress=0.0.0.0 \
 --conf spark.driver.host=$SPARK_DRIVER_HOST \
 tweetset_loader.py spark-create /dataset/sample/json

This presumes that your sample dataset is in the storage/dataset_loading/sample directory (or whatever NFS mount is mapped to /dataset in the .ENV file). In my testing, I put the tweet JSON files and dataset.json in a json subdirectory, but that's not strictly necessary.

Unfortunately, the UI will not work on this branch, due to the changes required for Python 3.8. So to test it in the UI, you should:
a. Bring down TweetSets on both clusters.
b. Check out the master branch or Laura's Py38 branch (t126-python-38).
c. Remove the server image (on the primary node) to force rebuild: docker image rm ts_server-flaskrun:latest
d. Brink TweetSets back up.

Expected Results

Elasticsearch indexing

Spark loader should load the data and create extracts without errors.
Datasets limited in TweetSets should mostly be identical. The following are expected differences:
- More hashtags may be present in this implementation, due to more consistent use of the extended_tweet and retweeted_status elements.
- More URL's may be present, for the same reasons as above.
- Fewer results for (at least some) keyword searches. This is due to the exclusion of the quoted_status text fields for tweets of type quote. (This decision was made in consultation with Laura for the sake of consistency.)

Let me know if you see inconsistencies in the indexing that don't make sense with the above.

Dataset extracts

Spark automatically allots each type to its own subdirectory. The following should be present:
- tweet_json
- tweet_ids
- tweet_csv
- tweet_mentions/top_mentions
- tweet_mentions/nodes
- tweet_mentions/edges

Each should contain one or more zipped files. I've tested the CSV files against those created by twarc.json2csv (1.12.1) and documented some minor differences in the data dictionary. Feel free to test them against full extracts created in the UI, but do keep in mind that the current version of TS is using an older version of json2csv.

In my testing, Spark created far too many files for the mention extracts. I assume that is a setting that can be configured, but I haven't looked into that yet.

Performance

I am curious to hear how you experience it. I would expect this code to be faster at least for large datasets -- it was so in testing on my own laptop -- but I haven't had a chance to test a large dataset on the VM's. There may be Spark settings we can tweak to improve performance, but my impression is that for some of these, what's optimal depends on the environment, and our dev VM's are not terribly good proxies for production. We might have more success testing this aspect on tweetsets-dev.

…y install for Spark submit

dolsysmith · 2021-08-20T16:01:45Z

Updated spark_utils.py so that the original JSON (unparsed) is stored in the tweet field. This method avoids discrepancies arising from the difference between Python's and Spark's handling of null fields in the original tweet JSON. The short version is that some fields in Twitter's JSON schema are present with null values, and others can be absent entirely. Spark, being schema-based, will treat absent fields and fields present but with a null value the same. Python, on the other hand, will look for the presence of a key, and according to the logic in json2csv, sometimes it will behave differently depending on whether a key is present (but empty) or absent entirely. This change should obviate this problem, as well as preserve the original structure of the Twitter API JSON.

dolsysmith · 2021-08-20T18:53:34Z

Updated the JSON schema file so that the full_text and retweeted_status.full_text fields are read (if present) by the loader.

lwrubel · 2021-08-24T12:06:45Z

spark_utils.py

+    '''Loads a set of JSON tweets as strings and adds a column index. We do this so that the ultimate output in the JSON extract will have the same null fields as the original.
+
+    :param spark: an initialized SparkSession object
+    :param path_to_dataset: a comma-separated list of JSON files to load'''


Should this be :param path_to_tweets:?

Thanks -- fixed.

…ES mapping

dolsysmith · 2021-08-24T17:35:18Z

Updated branch as follows:

Writing to ES (tweetset_loader.py, line 354) now uses only the subset of fields we want to index.
elasticsearch-hadoop configuration updated to exclude tweet_id field from indexing.
JSON object of tweet now joined to parsed tweet by unique tweet id. (I was joining on a row number previously, but such numbering is evidently not determinate when loading from multiple files. In other words, loading the same dataset twice can produce different row numberings if the data are distributed across multiple files.)

dolsysmith · 2021-08-25T18:25:56Z

New approach: we load the JSON-L as an RDD, then convert that to a DataFrame, allowing us to preserve the original string representation of the JSON as a separate column and obviating the need for a join (which creates problems on smaller datasets). This approach seems stable, but performance has taken a hit (relative to previous implementations) that is evident on larger datasets. Loading 20 GB with the new implementation (not counting the time to create extracts) took ~35 min, vs. 20 min using the implementation currently in production.

lwrubel · 2021-08-25T18:41:25Z

Is that comparison against the previous implementations of Spark 3? Or compared to TweetSets 2.1?

lwrubel · 2021-08-25T18:41:59Z

Oh never mind, I see you said compared to production. Sorry!

spark_utils.py

tweetset_loader.py

dolsysmith · 2021-08-27T19:21:39Z

This versions uses the RDD API for loading to TweetSets (in order to preserve the original JSON as is) and the DataFrame API to create the extracts. Performance is comparable to what's in production for loading and significantly improved for creating extracts.

The Spark SQL code includes fields that we use for indexing in Elasticsearch; these are dropped when creating the CSV. I will leave them there for now (I don't think their presence really impacts performance) with an eye toward a future release where we no longer need to load the full JSON into Elasticsearch. At that point, we can use the DataFrame API for everything (which should improve performance further).

dolsysmith requested review from lwrubel and kerchner August 13, 2021 15:44

dolsysmith self-assigned this Aug 13, 2021

dolsysmith linked an issue Aug 13, 2021 that may be closed by this pull request

Refactor tweetset_loader.py to use Spark DataFrame API #128

Closed

dolsysmith added 19 commits August 19, 2021 10:39

Added code for using Spark DataFrame API

9a05868

Fixed pyspark version

5ef31cd

Adding schema and SQL files

ceffc9f

Upgraded to twarc 1.12 for column name fix

0723af2

Working with Spark 3/Java 11/Python 3.8

57d1ad7

Updated README

0fab820

Adding headers to files

f58028a

Using elasticsearch instead of elasticsearch_dsl to trigger dependenc…

1d53efc

…y install for Spark submit

Fixed issues in creation of CSV

e32c6b1

Updating Spark SQL to reflect json2csv 1.12

1311d8a

Mapping to json2csv 1.12

7e012a3

Updating SQL for json2csv 1.12

09a36d7

Fixed syntax errors

412ddbb

Fixed logic to capture all URL's

c5e76ba

Fixed logic to capture all URL's

0ff89e9

Removing extra comma

751ea73

Hashtags from retweeted_status

66b3889

Order of hashtag fields

f21d5ed

Added readable version of schema

0739b6f

dolsysmith force-pushed the t128-refactor-loader branch from 674c0af to 0739b6f Compare August 19, 2021 14:44

dolsysmith added 3 commits August 20, 2021 10:52

Changing JSON handling for tweet data

f7bab20

Fixing indentation

3ee1211

textFile method requires string, not list

52cd0e6

Added support for full_text element to schema

a42cea2

lwrubel reviewed Aug 24, 2021

View reviewed changes

JSON of tweet now correctly indexed; removed extraneous columns from …

d81c832

…ES mapping

dolsysmith added 2 commits August 25, 2021 08:45

Refactored to avoid join

9148842

Fixed typo

fe401d0

lwrubel reviewed Aug 25, 2021

View reviewed changes

spark_utils.py Outdated Show resolved Hide resolved

lwrubel reviewed Aug 25, 2021

View reviewed changes

tweetset_loader.py Outdated Show resolved Hide resolved

dolsysmith added 2 commits August 27, 2021 10:11

Using RDD for ES load; DataFrame for extracts; no JSON extract

fc33807

Removed extraneous import

e845cf5

dolsysmith merged commit 5229297 into master Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T128 refactor loader #128, #122 #138

T128 refactor loader #128, #122 #138

dolsysmith commented Aug 13, 2021 •

edited

Loading

dolsysmith commented Aug 20, 2021

dolsysmith commented Aug 20, 2021

lwrubel Aug 24, 2021

dolsysmith Aug 24, 2021

dolsysmith commented Aug 24, 2021

dolsysmith commented Aug 25, 2021 •

edited

Loading

lwrubel commented Aug 25, 2021

lwrubel commented Aug 25, 2021

dolsysmith commented Aug 27, 2021

T128 refactor loader #128, #122 #138

T128 refactor loader #128, #122 #138

Conversation

dolsysmith commented Aug 13, 2021 • edited Loading

Summary

Setup

Testing

Expected Results

Elasticsearch indexing

Dataset extracts

Performance

dolsysmith commented Aug 20, 2021

dolsysmith commented Aug 20, 2021

lwrubel Aug 24, 2021

Choose a reason for hiding this comment

dolsysmith Aug 24, 2021

Choose a reason for hiding this comment

dolsysmith commented Aug 24, 2021

dolsysmith commented Aug 25, 2021 • edited Loading

lwrubel commented Aug 25, 2021

lwrubel commented Aug 25, 2021

dolsysmith commented Aug 27, 2021

dolsysmith commented Aug 13, 2021 •

edited

Loading

dolsysmith commented Aug 25, 2021 •

edited

Loading